Skip to content

Record batch machine command timings#1236

Open
karenychen wants to merge 7 commits into
Azure:mainfrom
karenychen:codex/batch-command-execution-times
Open

Record batch machine command timings#1236
karenychen wants to merge 7 commits into
Azure:mainfrom
karenychen:codex/batch-command-execution-times

Conversation

@karenychen

Copy link
Copy Markdown
Contributor

Summary

  • record per-batch BatchPutMachine command timings in AKSMachineClient
  • emit batch_command_execution_times as scale_machine operation metadata
  • cover metric capture and metadata emission in machine-client tests

Tests

  • PYTHONPATH=modules/python python3 -m pytest modules/python/tests/test_aks_machine_client.py modules/python/tests/test_machine_crud.py modules/python/tests/test_crud_main.py
  • git diff --check

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the AKS Machine scale-out path to capture per-batch BatchPutMachine timing metrics and emit them as operation metadata, with accompanying test updates to validate metric capture and metadata emission.

Changes:

  • Record per-batch start/end timestamps and execution duration in _create_batch_machines.
  • Emit batch_command_execution_times as scale_machine operation metadata when use_batch_api=True.
  • Update machine-client tests to cover timing metric capture and metadata emission.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
modules/python/clients/aks_machine_client.py Captures per-batch timing metrics and emits them as scale_machine operation metadata.
modules/python/tests/test_aks_machine_client.py Extends tests to assert timing metrics are captured and metadata is emitted for the batch path.
Comments suppressed due to low confidence (1)

modules/python/clients/aks_machine_client.py:921

  • execution_time_seconds is derived from datetime.now() deltas, which can go negative or be skewed if the system clock adjusts (e.g., NTP). For durations, use a monotonic timer (time.perf_counter() / time.monotonic()) and keep datetime only for the wall-clock timestamps.
        start_time = datetime.now(timezone.utc)
        self._make_batch_request(
            "PUT",
            url,
            body,

Comment thread modules/python/clients/aks_machine_client.py Outdated
@karenychen

Copy link
Copy Markdown
Contributor Author

Code review

No issues found. Checked for bugs and AGENTS.md compliance.

@xuexu6666

Copy link
Copy Markdown
Contributor

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@karenychen

Copy link
Copy Markdown
Contributor Author

Comment on lines +909 to +915
execution_time_seconds = (end_time - start_time).total_seconds()
request.batch_command_execution_times[first_machine_name] = {
"start_time": start_time.strftime("%Y-%m-%dT%H:%M:%SZ"),
"end_time": end_time.strftime("%Y-%m-%dT%H:%M:%SZ"),
"execution_time_seconds": execution_time_seconds,
"total_machines_in_batch": len(chunk),
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should record batch timing even when the BatchPutMachine request fails.

Right now the metric is written only after _make_batch_request() returns successfully. If it raises after retries / timeout / non-2xx, _scale_machine_batch() catches that worker failure and returns [], so the operation only reports landed X/Y machines and the failed batch has no timing entry. That makes the new metric biased toward successful batches and hides the slow/failing batch we most need for debugging.

Could we wrap the request in try/finally so every batch records elapsed time, and include a small status/error field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants